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ABSTRACT 

The effect of stratified sampling of items on the 
estimation of test score distribution parameters by multiple matrix 
sampling was studied. Item difficulty and/or interitem correlations 
were the bases of stratification. Various item iniverses were created 
by computer simulation and sampled according to several plans. The 
results indicate that stratification of items does not consistently 
improve the 5;tability of parameter estimation. The results also shov 
that the variance estimate used in many studies is biased for some 
item universes when difficulty stratification is used. A variance 
estimate developed in the current study removes this bias. 
(Author) 
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The Effect of Item Stratification 
in Multiple Matrix Sampling^ 

Matrix sampling consists of a sample of n examinees 
responding to a subtest of m items. The results of this subtest 
administration are used to estimate parameters of the tes't score 
distribution that would result if the population of N examinees 
responded to the universe of ivl items. When several n-by-m samples 
are used for this estimation the procedure is called multiple 
matrix sampling* The mean of the estimates from the several 
matrix samples is presented as the estimate of the test score 
distribution parameter. 

The usefulness of multiple matrix sampling has been 
demonstrated by several authors including Lord (1962) and Plumlee 
(196^). Once the efficacy of this method was shown, one of the 
main questions that needed to be answered was which sampling plan 
Droduced the most stably parameter estimates. Shoemaker (1970, 
1971) investigated this -question by varying the sizes and numbers 
of the item and examinee samples. Defining an observation as 
one examinee's response to one item he concluded that increasing 
the number of observations improved the stability of the estimates. 
He also stated that in estimating the mean it was best to use many 
small item samples. 

In other studies of sampling plans Kleinke (I969, 1972) 
tried item stratification to improve the stability of estimation 
from multiple matrix sampling. He stratified items on the basis 
of content, difficulty, and a combination of both, and concluded 
that stratification did not improve the stability of the estimates 
of the mean and variance from the stability attained using simple 
random sampling of items. However, Kleinke sampled from only one 
data base. He suggested that stratified sampling of item universes 
with a variety of combinations of item difficulties and interitem 
correlations be investigated before a conclusion is reached con- 
cerning item stratification in multiple matrix sampling. This 
investigation was carried out in the study presented below. 

^This DaDer is based on the author's Ph.n. dissertation sub- 
mitted to the faculty of the Graduate School of Syracuse University. 
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Theoretical Framework 

Rajaratnam, Cronbach, and Gleser (1965) derived equations 
for estimating; the coefficient of ^eneralizability from a model that 
took stratification into account. Two studies (Cronbach, Schonemann, 
and McKie, 1965; and Shoemaker and Osburn, 196R) have demonstrated 
that for stratified sampling of items these equations produced more 
accurate estimates of the coefficient than equ^^tions that were de- 
rived from a model that did not consider stratification. In the 
present study an equation for estimating the variance of a test 
score distribution based on stratification of items was derived. 
The development of this equation followed closely the method used by 
Sirotnik (1970) in deriving the equation for estimating the variance 
in matrix sampling without item stratification. Sirotnik based his 
derivation on a two-way analysis of variance model, the two factors 
being examinees and items, Throuf?!:h algebraic manipulation of the 
expected mean squares from this model he derived the equation 

wherei ~ estimated variance of test score distribution of 
proportion correct scoresf 

2 

s = sample variance of examinee proportion correct 
^ scores; 
.2 

s. = mean of sample item variances. 
J 

This equation had earlier been derived bjr Lord (I96O) using a 
method based on bipolykays. 

If the items are stratified , the appropriate analysis 
of variance model is a split-plot design with items nested within 
strata and completely crossed by examinees. The equation for 
estimating the variance from this model is 



»2 ^ n(N^l) 
Ys NM(n-l)(m-H} 



<7 



M(Tn-H)s„ - (M-in)sl + 



y 0 m 



[2] 



A 2 

where t = the appropriate estimator of the variance of the 

test score distribution of proportion correct scores 
when items are stratified sampled; 



m 



^ = the number of items sampled from stratum h; 
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s^/. X = sample variance of /examinee proportion correct 
' scores within stratum hj 

H = number of strata in sample or universe. 

i'.ethodology 

Computer simulation of examinees responding to dichoto- 
mously scored items produced the universes that were investigated. 
The study was carried out using programs written in Fortran IV 
and run on an IBM System/370 comDuter. The distributions of item 
difficulties and interitem tetrachorics were manipulated to p-'o- 
duce a variety of universes. The tetrachorics were used to simu- 
late content strata. 

Three distributions of item difficulties were used - 
rectangular, normal, and negatively skewed. The negatively skewed 
distribution was a reflection of a chi-square curve with three 
degrees of freedom. All three distributions were limited to 
difficulties between .1 and .9. Item difficulties for each item 
universe were pseudo-randomly sampled from these distributions. 
The difficulty strata were established by ranking the difficulties 
from low to high and then dividing this ranking into quarters. 
It ibjiould be noted that the item universes that were created by 
simulation did not exactly meet the specifications discussed 
above because each examinee population consisted of only 1000 
examinees. The first four moments of the distributions that were 
produced were well within the expected range of error. The accu- 
racy of some of these moments could not be determined precisely 
because the curves for the normal and skewed distributions had 
closed, not infinite, tails. 

Each distribution of difficulties was paired with each 
of three sets of interitem tetrachorics to form different universes. 
Twenty-four universes were studied. The within strata and among 
strata tetrachorics for the three sets were, respectively, .3 and .3, 
.5 and .3, and .5 and 0. The last set of correlations is not likely 
to be found on a mental test but was included in the study to deter- 
mine the effect of such pure content strata on the stability of 
parameter estimation in multiple matrix sampling. There were four 
content strata in each universe. 

In i^enerating examinee responses to items the assumption 
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was made that the underlying ability of examinee i on the skill 
beinff measured by item j within stratum h is represented by the 
linear model for the split-plot analysis of variance desi^^i. 
This model is 

1.1(h) 1 h 3(h) hi 10(h) L-^J 

wheret ^^^(h) ^ "the ability level of examinee i on the skill 

being measured by item ^ within stratum h; 

^ = general effect, equivalent to the matrix population 
mean; 

= the effect of examinee i; 
8^ = the effect of stratum hj 
'^j(h) effect of item j within stratum h; 

8^^^ = the interaction between examinee 1 and stratum h; 

^'^ii(h) " interaction between examinee i and item 
*^ 2 within stratum h. 

To produce items with the desired tetrachorics it was necessary 
to generate these underlying ability levels for the items as 
multivariate normal variables with product-moment correlations 
equal to the specified tetrachorics. The correlation of abilities 
tested by any two items, ,i and j*, is represented by the correla- 
tion between ^^^(^^) and X^^^^^^^^^ (h = h* and j = may be true) 
across the examinees. This will be indicated by r..-. This 
correlation between two sums is affected by the examinee related 
components of aquation 3 - and 4^-^. The covariances of 8^ 

and 8^^ and of ^j^^j and ^^^^j^^^j are equal to zero since these 
factors are constant for all values of i. The and *^ fartors 
were generated as normally distributed variables with means equal 
to zero and variances of one. The covariance of the ^ factors was 
equal to one since the effect of examinee i was the same for all 
items. Since the 8^ factors were fixed effects it was determined 
that the covariance of ^^^^^ and ^^^x^^ was equal to I/3 for all 
h / h*. This has been proven by Searle (1971, pp. ^00-402). Using 
the values discussed above and setting the value of <^^^ equal to 
five, the equation for J^-jj* could be solved for the values of r^^ 
(the correlation between ^'^^^(^^x) ^^ij*(h*)^ needed to produce 

the desired values of ^-jj** Establishing equal to five was 
necessary to produce values of r^^ that would form a proper corre- 
lation matrix. The terms were normally distributed with means 
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equal to zero. The sum of the three examinee related terms pro- 
duced the examinee ability level. These ability levels were 
standardized to normal variables with means equal to zero and 
variances equal to one. 

The pseudo-random generation of multivariate normal 
deviates was performed by first generating a row vector, U, of 
independent normal deviates using a method outlined by Meyer (1969). 
This vector was then transformed to V = UT, a vector tl.at was, in 
effect, sampled from a population of vectors whose elements have 
specified correlations. The transformation matrix, T, is the 
upper triangular matrix that results from the square-root decompo- 
sition of the correlation matrix of the variables being generated. 
This method of generation was outlined by Parr and Slezak (1972). 

The continuous ability levels were dichotomized by com- 
paring the examinee's ability score for item _itoZj = *"(l-Pj)i 
whare * is the standard normal distribution function and is the 
difficulty of item j. If examinee i's ability level on item j was 
greater than or equal to Z^- examinee i was considered to be 
successful on item j and was given a score of one for that item. 
If the ability level was less than Zy the score was zero. This 
method t»roduced items with resulting difficulties that were 
extremely close to the values that had been originally generated 
to form the item universes. About S5'% of the items had difficulties 
that were within .02 of these original values. The tetrachorics 
that resulted from this method, for the universes of 1000 examinees, 
were generally close to the values stjecified. Eighty-four percent 
of a 2.5^ pseudo-random sample of the correlations were within 
+ .07 of the specified values. 

For each combination of difficulties , tetrachorics, 
and type of stratification three multiple matrix sampling p^ans 
were studied to see what effect differing item sample sizes would 
have on the estimation of the mean and variance. The three plans 
divided the 4B-item universes into 3» o. or 12 samples. The items 
were exhaustively sampled. The types of sampling studied were 
simple random, difficulty stratified, content ( tetrachoric) 
stratified, and combined difficulty and content stratified. Each 
item sample was matched with an examinee sample of l6. In none 
of the plans were the examinees exhaustively sampled. Each sampling 
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plan was replicated 500 times for each universe • 

To judge which item sampling plan and which variance 
estimator produced the best estimates i the mean squared errors 
(MSEs) of the estimates were compared • Negative estimates of the 
variance were included in the computation of the MSEs because the 
main purpose of the study was to investigate the stability of the 
parameter estimates. Not including the negative estimates would 
have distorted the estimates of this stability. 

Results and Discussion 

The results indicate that stratification of items does 
not consistently improve the stability of the estimation of the 
mean and variance in multiple matrix sampling for the item universes 
and sampling plans studied. In a few cases there may be evidence 
favoring stratification. However, with one possible exception, no 
trend emerges to indicate that stratification of items should be 
recommended on statistical grounds. 

Presented in Table 1 are the results for estimating the 
mean. The means (^) and the MSEs for each distribution of 500 
estimates are presented along with the means (/*) of the test score 
distributions. In l6 of the 2k universes studied, stratification 
of items on the basis of difficulty produced a smaller MSE than 
simple random sampling of items. However, no distribution of 
difficulties, no set of tetrachorics , nor any sampling plan had 
a systematically lower MSE for stratification* In only 7 of IB 
cases was the MSE from item sampling with content stratification 
less than from simple random sampling of items. The only systematic 
improvement was found with simultaneous stratification of content 
and difficulty where stratification produced smaller MSEs for all 
6 universes. 

The results for estimating the variance are shown in 
Tables 2 and 3. The means (^v or and MSEs for each distri^ 

bution of 500 estimates are presented along with the variances {^^) 
of the test score distributions. Difficulty stratification and 
simultaneous difficulty and content stratification were sometimes 
accompanied by a negative bias when used to estimate the 

variance. The bias increased as the difference between the inter- 
item correlations within strata and among strata increased. There 
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was no bias when the within and among correlations were equal. The 

bias was removed by There was not any bias in the variance 

a2 

estimates of when content stratification was used. 

a2 a2 

When the derivations of and o^^g stre compared, the alge- 
braic representation of the bias in ^® seen* Sirotnik (1970) 
showed that 



,2 ^ E[MS(exam.)1 . (l-g) E[MS(exam. by items)] 



(7 , 

Y m m 



[3] 



when the two-way analysis of variance design is used. When the 
SDlit-Dlot design is appropriate, the second term on the right 
side of equation k becomes 

^1-^^ E|MS(exam, by items within strata)j 

- , 

The first term remains the same. The relationship between ex- 
pression 5 and the second term on the right side of equation 4 
can be determined from the equality 

SS(exam, by items) - SS(exam. by strata) + 

SS(exam, by items within strata). 

After determining the expected values of both sides of equation 
a2 

6, the bias in ^y^when stratified sampling is appropriate>can be 
shown to be 
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m(H-l) 2 ^-^h 2 r^-i 

As the sampling fraction for items decreases the second term in 
expression 7 becomes dominant, increasing the negative bias. The 
results in Table 2 verify this statement. The reasons that cer- 
tain interitem correlations affect this bias have not been deter- 
mined. Future investigation of this problem is needed. 

In general, stratified sampling is beneficial compared 
to simple random sampling when it establishes a sampling plan that 
can force similarity among samples and thereby control a large 
portion of the variance across the samples. Item stratification, 
as done in the present study, does not do this. It would be 
possible to control more variance across samples if examinees 

12 
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were also stratified. However, the complexity of such a sampling 
plan may make it impractical. 

Another possible way to improve the stability of estima- 
tion in multiple matrix sampling might be to sample more items 
from strata with larger variances of item difficulties. Cochran 
(1963, p. 96) has shown that in the usual one dimensional sampling, 
larger samples should be taken from striata wi ;r variances. 

However, the results of the present study seem ou indicate that 
this probably will not reduce MSEs in multiple matrix sampling. 
The normal and skewed distributions of items had strata with 
unequal variances of item difficulties. If these unequal variances 
had an effect on the MSEs of the estimates from universes with 
normal and skewed distributions of difficulties the evidence 
presented by Cochran indicates that stratified sampling; would 
h^.ve Droduced consistently larger MSEs than simple random samplir.^^ 
for these universes. The results did not show this. The proportion 
of universes in which stratified sampling produced smaller MSEs was 
about the same for all three distributions of difficulties. For 
example, in estimating the mean for universes with rectangular 
distributions, 11 of I6 sampling plans favored stratification. For 
the skewed and normal distributions there were 8 of I6 and 10 of 
16 plans, respectively, that favored stratification. 

The conclusion that item stratification does not improve 
the stability of parameter estimation in multiple matrix sampling 
is consistent with the conclusion presented by Kleinke (1972). 
However, as he^ pointed out, there ^"nay be practical considerations 
that indicate stratification should be used. One such consideration 
is the time needed to administer each sample of items. Certainly 
most principals would not want to have a test used in their school 
that would cause some students to finish long before others. There 
is always going; to be some variance in testing time for examinees 
but stratified sampling of items can help to minimize this variance. 
Item stratification does not hurt the stability of estimation when 
the proper variance estimation is used. Thus, if practical problems 
can be solved by item stratification, it certainly should be used. 
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